Transitioning from serial CPU programming to GPU programming requires a paradigm shift: from element-wise iteration to block-based execution. We no longer view data as a stream of scalars, but as collections of "blocks" scheduled to saturate hardware bandwidth.
1. Memory-Bound vs. Compute-Bound
A kernel's bottleneck is determined by the ratio of math operations to memory accesses. Vector-add is often memory-bound because it performs only one addition for every three memory operations (2 loads, 1 store). The hardware spends more time waiting for DRAM than calculating.
2. The Role of BLOCK_SIZE
BLOCK_SIZE defines the granularity of parallelism. If it's too small, we underutilize the GPU's wide execution lanes. An optimal size ensures enough "work in flight" to saturate the memory bus.
3. Latency Hiding through Occupancy
Occupancy is the number of active blocks on the GPU. While not the ultimate goal, it allows the scheduler to swap in a new block to perform math while another waits for high-latency memory fetches from VRAM.
4. Hardware Utilization
To maximize performance, we must align our BLOCK_SIZE with the GPU architecture's memory coalescing rules, ensuring that consecutive threads access consecutive memory addresses.